1: Introduction

This midterm project aims to create a more accurate and generalized model of house price prediction in San Francisco, California for Zillow. As a hot topic, house price is always the concern for not only property owners and renters, but also for the government to provide a series of public services such as tax assessment. An accurate and generalized house price prediction model could allow us to perform private activities such as buying or renting with more reliable information. It also allows the government to serve with better social equity.

The exercise for creating a good model is a challenge though. First of all, no one really knows the correct values of the properties. People determine to buy houses with a certain amount of money for multiple reasons. It is sometimes hard to name the reasons, not to mention to quantify them. Secondly, the variables collected for the model are more or less decided upon the machine learners opinions. It is challenging to dig out the golden variables among the piles of data online. Also, linear regression might not necessarily be an effective way to predict the house prices due to the complicated relationship between house prices and independent variables. Last but not lease, San Francisco is a global city with a lot of diversity happening inside each neighborhood which makes it even harder to create a good prediction.

The overall modeling strategy is using the hedonic model to deconstruct house prices into a group of physical characteristics such as the number of bedrooms and property areas, as well as a group of places-based characteristics such as the average distance to nearest two crimes and distance to airport/park. Also, the model includes the features spatially determine the qualities that houses are located.

To summarize, the model produced in this project is effective and has a good generalizability. It performs well for the houses that have middle house prices. For houses that are with very high prices, the error might be larger. In general, the model could be used as a reference for Zillow.

2: Data Setup

2.1: Briefly describe the methods for gathering the data

For the internal characteristics of the houses, such as the number of bedrooms or the property areas, the data is directly from Zillow dataset. For the amenities or the public services the houses are exposed to, most of the initial data is from Open Data Sanfrancisco website. Some feature engineering happened during the transformation of the data. For instance, to measure the school service each house can get, the distance to the nearest school is measured. In the feature engineering, other data is also transformed into the measurement of the exposure to crimes, homeless concerns, as well as the distances to highways and parks, etc. For the spatial structure, some of them are collected from ACS while some are from Open Data San Francisco.

2.2: A table of summary statistics with variable description

This section includes the description of all the variables that we collected. Some variables might be excluded from the model after observing the correlation matrix. The SalePrice is the dependent variable that the model is dealing with. It ranges from $100,001 to $4,750,003 from the current observation we have.

I. Internal characteristics

For internal characteristics, the property area and lot area are used as continuous variables. Built year is transformed into a binary variable of 0 or 1 where 1 represents the properties built before 1938 or after 1965, 0 represents the properties built between 1938 and 1965.

The number of bedrooms, the number of bathrooms and the number of stories are transformed into categorical data. Bedroom categories are 1 Bed, 2 Beds, 3-5 Beds and 6+ Beds. Bathroom categories are 1 Bath, 2 Baths, 3-5 Baths and 6+ Baths. Story categories are 1 Floor, Up to 3 Floors, 4 Floors and 4+ Floors.

II.Amenities & public services

For amenities and public services, there are continuous variables as below: the distance to the nearest homeless concern report, the average distance to the two nearest crimes, the average distance to the three nearest fire incidents, the distance to the nearest school, the distance to the nearest park, the distance to the nearest art building, the distance to the nearest hospital, the average distance to the three nearest retail stores, the average distance to the three nearest restaurants, the average distance to the five nearest bus stations, the distance to the nearest BART station, the average distance to the five nearest evictions, the distance to highway, the distance to arterial roads, the distance to historic districts.

There is one binary variable, the airport buffer, in which 0 represents the properties are beyond 10 miles of the airport while 1 represents the properties are within 10 miles of the airport. The distance of 10 miles is decided well-accepted knowledge of the noise and air pollution regarding the distance to the airport.

III.Spatial structure

The spatial structure variables contain following continuous variables: the area of tree canopy, population density, percentage of bachelor degree, median income, percentage of household and percentage of vacancy. There are also two binary variables. In the slope variable (“great20”), 0 representing less than 20 degrees while 1 represents greater than 20 degrees. In the race variable (“MajorityWhite”), 0 representing less than 50% white population while 1 presenting more than 50% white population.

## 
## ===============================================================================================
## Statistic           N       Mean       St. Dev.     Min    Pctl(25)    Pctl(75)        Max     
## -----------------------------------------------------------------------------------------------
## SalePrice         9,403 1,145,288.000 701,138.400 100,001  695,001.5   1,380,003    4,750,003  
## Homeless_nn1      9,403   1,943.167    1,445.771  21.965    849.615    2,654.056    9,981.013  
## Crime_nn2         9,403    295.042      115.153   62.671    208.126     353.615     1,009.099  
## Fire_nn3          9,403   2,331.342    1,225.599  171.914  1,498.301   2,801.456    7,700.580  
## School_nn1        9,403    907.436      473.959   24.227    545.918    1,201.668    3,320.439  
## Park_nn1          9,403   1,292.545     618.394   43.515    818.260    1,710.708    4,891.932  
## Airports_Buffer   9,403     0.828        0.377       0         1           1            1      
## ArtBuilding_nn1   9,403   2,484.204    1,248.125  39.529   1,566.045   3,229.651    7,121.239  
## Hospital_nn1      9,403   6,935.752    4,102.276  122.779  3,375.781  10,517.740   15,971.210  
## Retail_nn3        9,403    533.504      260.655   62.043    343.275     677.475     3,079.878  
## Food_nn3          9,403    736.380      392.323   69.011    439.233     958.958     2,781.677  
## Bus_nn10          9,403    508.458      226.660   26.269    338.397     641.557     1,597.109  
## BART_nn1          9,403   8,674.266    5,769.163  268.375  4,083.046  12,113.660   26,623.030  
## distance_highway1 9,403   7,134.042    4,938.319  40.866   3,120.113  10,157.950   21,661.440  
## Evictions_nn5     9,403    565.146      290.328   149.573   376.773     661.515     2,944.874  
## great20           9,403     0.217        0.412       0         0           0            1      
## distance_Arterial 9,403    710.261      533.907   35.469    278.583    1,016.486    2,970.281  
## distance_Historic 9,403  15,995.690    6,474.052  75.346  10,695.400  21,380.120   30,141.130  
## LotArea           9,403  279,746.500  102,306.200 18.000  237,400.000 300,000.000 1,890,500.000
## PropArea          9,403   1,642.193     703.976     187      1,160       1,978        7,679    
## BYear             9,403     0.333        0.471       0         0           1            1      
## Tree              9,403   7,180.351   34,091.010   0.000   1,632.773   6,833.962   818,378.200 
## Med_Income        9,403  110,471.800  33,062.720     0      82,734      137,969      195,375   
## Pct_bachelor      9,403     0.545        0.202     0.000     0.382       0.711        0.892    
## Pct_hhold         9,403     0.602        0.153     0.000     0.484       0.722        0.863    
## Pct_vacancy       9,403     0.058        0.034     0.000     0.037       0.080        0.236    
## Pop_den           9,403  22,927.060    8,964.998   0.000  17,032.230  27,687.410   108,640.300 
## MajorityWhite     9,403     0.438        0.496       0         0           1            1      
## -----------------------------------------------------------------------------------------------

2.3: Correlation Matrix

According to the correlation matrix, most variables do not have collinearity. The distance to the nearest hospital (hospital_nn1) has some correlation with distance to historic district. Since these two variables are not theoretically related, they are both included in the model. The percentage population with bachelor’s degree is correlated with hospital, distance to historic district, median income and percentage of household. So the percentage population with bachelor’s degree is excluded from the model building.

2.4: Home Price Correlation Scatterplot

The four factors of interest selected here are the distance to the two nearest crime, the distance to the three nearest restaurants, median income and property area. According to the scatterplots, as the distance to the two nearest crime becomes longer, the house price becomes higher. As the distance to the three nearest restaurants get longer, the house price stays stable with a tendency to decline. As the median income in a census tract which the house is located in gets higher, the house price gets higher. As the property area gets larger, the house price gets higher.

2.5: Mapping the dependent variable

It is clear to see in the sale price map that, the highest house prices gathered around the center area of San Francisco such as neighborhoods Corona Heights, Dolores Heights, Sherwood Forest, as well as the north area of San Francisco besides the Presidio and the sea such as neighborhoods Presidio Heights and Marina. The lowest house prices gathered around the south part of San Francisco and the west part of San Francisco.

2.6: Independent Variable Maps

I. Average distance to 2 nearest crime locations

I.I Intereting Finding —— Type of Crime

The chart below shows the categories of crime incidents happened in San Francisco from 2012 to 2015. From all the categories, we chose the followings to include in our variable: Assault, Burglary, Robbery and Rape.

The map below shows the average distance from each house to its two nearest crime incidents. If we define a neighborhood with shorter distance to crime as having a relatively more unsafe environment, those neighborhoods are gathered around northeast San Francisco and southeast part of San Francisco, including Russian Hills, Telegraph Hill, Candlestick Point, etc. The neighborhoods which have a longer distance to crime that could be defined as safer neighborhoods are the neighborhoods along the west coast and generally west part of San Francisco. Judged from this map and the map above, a longer distance to crime incidents might not be an effective factor to influence house price, but a shorter distance to crime is promising to have negative effects on house price.

II. Median Household Income

Median income here is described as six different levels without overlapping. According to the map, the center area of San Francisco and some part of the northern area have relatively higher median income level. Associated with the sale price map, higher income level areas normally have higher house prices while lower income areas normally have lower house prices.

III. Average distance to nearest hospital

III.1 Intereting Finding —— Type of health care facilities

The following chart shows all the health care facilities in San Francisco. Considering the importance they have for houses, only the general acute care hospitals are kept as the variable.

According to the map below, the distribution of hospitals in San Francisco is spatially clustered since the center and the northeast area have a closer distance to the nearest hospital than the west and southeast part of San Francisco. With the sale price map, the distance to hospital map reflect less correlation between house price and distance to hospital since the shorter distance area have both high price neighborhoods and low price neighborhoods.

3: Model Building Methods

The general method used in this project is to build a hedonic model which deconstruct house price as three parts (internal characteristics, amenities/public services and spatial structure) and then connect the house price mathematically to these three parts.

I. Data Wrangling

In this part, different data is collected from available resources such as San Francisco Open Data, ACS and Berkeley Geo Library. There are three parts of the variables: internal characteristics, amenities/public services and spatial structure. Internal characteristics represents the houses internal features such as the number of bedrooms. Amenities/ public services represents the services the houses can get in the nearby environment such as schools or parks or crimes. Spatial structure represents more of the quality of the area that the houses are located in, such as the slope or the demographic conditions.

II. Feature Engineering

After gathering all the data, some necessary process of data transformation makes the data convert to variables in an easier way to achieve the mathematic connection to house prices. For example, the crime data collected from San Francisco Open Data is geo points on the maps. To relate them with the houses, the average distance to some nearest crime incidents points from the houses is measured. Another example is the airport buffer. If a house is within 10 miles distance from the airport, it will get a value of 1 representing it is within the buffer. If not, it will get a value of 0. These variables will all contribute to the final built-up of the hedonic model.

III. Correlation and Multicollinearity

In this section, the relationship between each two variables are tested. If two variables are highly correlated, they may have very similar representations statistically so it is meaningless to include both of them. Some variables that are highly correlated are removed after consideration.

IV. Regression Model

Using the available variables left, a linear regression model is built to show the connection between house price and the three variable categories. Here the model has 25 variables among all the categories. The original model does not include neighborhood effect. Since some errors are clustered in specific neighborhoods, neighborhood effect is included in the model to improve its performance. The model could explain about 63% of the variations in the sale price while there is an approximate $293,574.7 prediction error existed in the prediction.

V. Cross Validation
To get an accurate and generalized model, cross validations divided the observation data into two sets: training (60% of the data) and test (the rest 40%). The idea is to use the current variables to test on the test sets to see if every randomly chosen test set could be explained well by the model. After 100 folds of the test, around 64% of the variance in sale price could be explained by the model.

4: Regression Results and Analysis

4.1: In-Sample Results

4.2: Out-Sample Results

According to the summary of five different tests, the R^2 gathered around 0.62-0.65 while the MAE is around $293,000 and the MAPE is around 29%.

Test R.2 MAE MAPE
1 0.6266 294814.6 28.82%
2 0.6465 295499.5 30%
3 0.6362 293152.8 29.47%
4 0.6389 293089.3 29.64%
5 0.6423 293975.2 30.51%

4.3: Cross-validation test

I. Cross-validation Results

Upon the 100 folds of the cross-validation tests, the RSME indicates that there are some large errors in the model prediction since there is a large difference between RSME and the MAE. According to the RSME, the error is $422,131.5.

RMSE R.2 MAE
422131.5 0.6389699 293574.7

II. Histogram of the cross-validation MAE

The histogram of the cross-validation MAE demonstrates that errors are mostly gathered around $260,000 to $320,000.

4.4: Predicted Prices VS Observed Prices

As shown in the plot below, the black points are the real sale price in the dataset while the pale green line represents the perfect prediction where predicted value equals observed value. The blue line is the actual prediction the model provides. It’s clear that when the sale price is around $1,000,000, the model predicts with relative less residuals. When the saleprice is below $1,000,000, the model tends to over predict the house price. When the sale price is above $1,000,000, the model tends to underpredict the house price. Since there are some houses with very high prices, the model’s performance on higher value houses are poorer.

4.5: Residuals for 40% selected test set

As said in the previous section, the residuals mostly appear at the places where the house prices are very high. These areas are the center area of San Francisco and the northern areas. For the areas where house prices are relatively low, the residuals are relatively small.

4.6: Moran’s I test

The Moran’s I produced from the test set by this model is close to zero which represents a spatial randomness. The observed Moran’s I is in red but it is not higher than all the 999 randomly generated permutations. Thus, the Moran’s I indicates that the model includes some features that can represent the house price’s spatial structure although this part may still miss some factors.

## Warning: Removed 2 rows containing missing values (geom_bar).

4.7: Predicted Prices for entire dataset

The map below shows the predicted house prices by the model created in this project. The general spatial trends match the current known house price trends. The highest house price appears around the center area of San Francisco and the northern part of San Francisco. The lower prices appear around the west side of San Francisco and the south side of San Francisco. However, if we look into the map, it could be easily found that inside each neighborhood, the price variance is very small. This is due to the limitation of the data scale as many of them are neighborhood scale or census tract scale. The extreme case for this is Hunter Points, which has a MAPE of 2.24 and only have around five house price data points.

4.8: MAPE by Neighborhood

The map below demonstrates the MAPE by neighborhood in San Francisco. According to the map, it is clear that the west part of San Francisco generally has a lower MAPE than the east part of the city. Some neighborhoods that have high house prices normally have a higher MAPE. Other neighborhoods in the south which have high MAPEs have very little house sale price data so that the estimation is problematic.

4.9: Scatterplot of MAPE by neighborhood as a function of mean price by neighborhood

According to the scatterplot below, the MAPE remains relatively stable as the mean sale price varies. It has a slight trend to decline as mean sale price goes up. The scatterplot indicates that the model has an overall stable goodness of fit among the neighborhoods with the mean price between $1,000,000 and $2,000,000 regardless of the price variation. In another words, the model has a good generalizability among different house price neighborhoods.

4.10: Generalizability test of two racial groups with tidycensus

With ACS racial data coming from tidycensus, a race context is printed as below with two categories: majority white and majority non-white. As stated above, the center area and north area of San Francisco have higher house price generally. Here in the race context, these areas mostly have majority white population.

The table below shows the MAE for majority white neighborhoods and majority non-white neighborhoods. In the model set up for this project, the MAE for majority non-white neighborhoods is around 0.05 lower than majority white population. Although this represents the possible inequity for white people, the small difference also shows the generalizability of the model.

Mean Absolute Error of test set sales by neighborhood racial context
Majority Non-White Majority White
0.2839602 0.3018766

5: Discussion

In general, it is an effective model as it has a high R-square of 0.64, which means the model could result in 64% of variations in housing sale price. In addition, the model is generalizable and can be used to predict sale price among various neighborhoods in terms of the race different and income difference. The interesting variables used here are the distance to the two nearest crime incidents, median income and the distance to the nearest hospital. Around 64% of variation in home sale prices can be predicted with this model. The most important predictor variable is the distance to the nearest hospital as the p-value of this variable is less than 0.001, which means it is very significant. With the Moran’s I test and MAPE map by neighborhood, it’s obvious that distribution of residuals is auto-correlated spatially. In other words, spatial variation in prices can be accounted.

In order to make sure the generalizability of the model, we did cross-validation test and the standard deviation MAE is small enough to indicate the model is generalizable. Overall, the model predict particularly well outside the areas where the house prices are extremely high. There are several reasons for the poor performance in the high house price areas. For instance, the process of data wrangling and feature engineering are relatively subjective. Besides, the data that are accessible might not be representative enough for house price prediction. Lastly, the linear regression model might not be an effective option for those houses with extremely high prices.

6: Conclusion

We would highly recommend this model to Zillow as our model is very generalizable. To make the model more sustainably usable, the update of the data is necessary. Moreover, we need to consider special characteristics of various cities and add the data that can represent the cities to our model in order to perform better.